A theory-based practical solution to correct for sex-differential participation bias

Most genomic cohorts are retrospective where the exposures and outcomes are predetermined prior to sample collection. Therefore, a spurious association between an exposure and an outcome can arise if both variables affect study participation. Such concerns were raised in previous studies questioning the representativeness of the UK Biobank. Recently, a genome-wide association study (GWAS) on biological sex found many autosomal hits and non-negligible autosomal heritability which the authors attribute to selection bias. In this study, we propose a simple and a practical method that can overcome sex-driven selection bias based on theoretical analysis and simulations. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02703-0.


Introduction
This section introduces the basic concepts that are required to understand the current study. C is said to be a collider of A and B if it is a common effect of A and B. As we are particularly interested in a case where A and B are marginally independent, the relationship between A, B and C is summarzed in the following directed acyclic graph.

A B C
The particular variables of interest in this study are biological sex (defined by the sex chromosome configuration), genetic variants and study participation. We write Z ∈ {0, 1} for sex, G j ∈ {0, 1, 2} for the j-th genotype and C ∈ {0, 1} for study participation. Let G = {G j |j = 1, ...}.
This translates the diagram above into the following.

Z G C
Although Z and G are marginally independent (Z ⊥ ⊥ G), they may be conditionally dependent when conditioned on C, i.e. Z ⊥ ⊥ G|C [1]. This conditioning is represented as a box surrounding C. This spurious dependence between Z and G conditional on C is called the collider bias.
As we only observe samples who participated in a study, the diagram above predicts that spurious dependence between Z and G may be observed in large genetic cohorts. Indeed, Pirastu et al. reported that such bias is present in the 23andMe and the UK Biobank datasets [2]. Additional components, for example, such as an unobserved confounder U of C and G can be included.
This also opens a backdoor between Z and G through the path G ← U → C ← Z conditional on C. However, this backdoor is closed regardless of conditioning on U if one does not condition on C. Hence, U will not play a significant role in the remaining arguments in this study.
2 Three models of collider bias Pirastu et al. proposed three models that can potentially explain the association between sex and autosomal variants [2]. For completeness, we restate the three models in this section. The figures below were adopted from Pirastu et al. with some changes in the notation.
Model A depicts the scenario where sex Z and a genetic variant G affects Y , where Y is a phenotype that has a causal effect on the participation C. Because C is a descendant of Y which is a collider of Z and G, conditioning on C will induce a collider bias between Z and G [1]. Model B is similar to Model A but Z influences C directly rather than indirectly through Y . Pirastu et al. used the same functional relationship of Z and G on C to depict both Model A and Model B. Specifically, for both models, they assumed the logit of the probability of C = 1 as a linear function of Z and G. It can be found in their supplementary material that for Model B after dropping the coefficients of the predictors for notational convenience. Therefore, the association between Z and G due to the collider bias is essentially identical in both models.
In the scenario of Model C, Z alters the effect of Y on C. For example, the effect of trait Y on C can be positive in men but negative in women. This model requires a more thorough inspection. Importantly, the third figure depicting Model C is not a standard DAG. In a standard DAG, an arrow only enters and exits a node but not another arrow. However, in Model C, the arrow exiting Z enters the arrow from Y to C. This choice by Pirastu et al. comes from the fact that the standard DAG cannot explicitly express interaction between variables. A conventional way to express interaction will be using a standard DAG which is identical to Model B but stating a explicit functional form of C with G and Z as a predictor. In their supplementary material, they used a constant K to depict the modifying effect of Z on Y 's effect as In the following chapters, we develop a general model that encompasses all three models as a special case. We will use standard DAGs to depict the situation and will attach a functional form for each variable if necessary to avoid the use of a non-standard DAG.
Finally, at this point, one might have noticed that Model A differs from Model B in that sex Z affects trait Y . For example, if Model A is true, the prevalence of trait Y can differ depending on sex, which is not the case in Model B. Although this is not an issue for the following two chapters, we will address this problem in the last section when it comes to effect.

An analytic expression for collider bias
In this section, we assume that study participation takes place according to the following logit model: is a binomial distribution with n trials with probability p, G j is the standardized genotype of the j-th variant, Z is the sex, and is the residual error following a normal distribution with mean zero and variance σ 2 . β are the corresponding regression coefficients; β g→c is the effect of G j toward participation, and β z→c is the effect of Z toward participation. This simple model does not assume the sex-gene interaction (sex-differential participation), but we will later extend this model to include interaction. Let β g→z be the effect of G j on Z caused by the collider bias. A tilde on a symbol, e.g. β, stands for the biased parameter that is estimated under non-random selection into the study. Let π c|z = P(C = c|Z = z) be the sex dependent participation probability. Similarly, let π c|gz = P(C = c|G = g, Z = z) be the sex dependent participation probability conditioned on genotype. Then, the following lemma holds exp β g→z = π 1|(g+1)1 π 1|g0 π 1|g1 π 1|(g+1)0 Proof of Lemma 1.
Proof. By definition, the effect size under the logit model can be represented as an odds ratio such that exp β g→z = P(Z = 1|G = g + 1, C = 1)/P(Z = 1|G = g, C = 1) P(Z = 0|G = g + 1, C = 1)/P(Z = 0|G = g, C = 1) which holds for any value of g. Note that we used g + 1 to present a genotype change by a single standardized unit. Since the following equation holds, P(Z = z|G = g, C = c) = P(Z = z, G = g, C = c) P(G = g, C = c) we can substitute the elements on the right-hand side of the equation 1 with equation 2. Then, given the independence between G and Z (P (G = g, Z = z) = P (G = g)P (Z = z)), all terms other than π c|gz cancel out and the equation is simplified to the desired result.
Based on this lemma, we can obtain the following theorem that shows that the effect of genetic participation (β g→c ) can be obtained without observing samples of C = 0 (samples who did not participate) under the assumption of no sex-gene interaction. Theorem 1. Under the logit model described above, the following holds: Proof of Theorem 1.
Proof. The proof is based on Nguyen et al. with slight modification [3]. We assume that the effect of variant g on C is small, i.e. β g→c is small. For notational convenience, let us define The participation probability π 1|gz can be represented as the expected value of participation (because we defined participation as 0 and 1), which has no closed form expression. However, this expectation (or integration, equivalently) can be approximated to the following [4]: This approximation holds because the Gauss error function (erf) approximates the logistic function.
The theorem above has the following implications. The theorem provides an analytic relationship between the spurious association between sex and autosomal variants, and the contribution of variant g on study participation. The most important implication of the theorem is that although β g→c can not be directly obtained from the data since samples with C = 0 are never observed, it can be inferred using only the sampled data by multiplying a constant factor (π c|1 − π c|0 ) −1 to β g→z which is possible to estimate.
The second implication is that it provides a formal justification behind the direction of the collider bias given the direction of the correlation between the causes and the collider. Griffith et al. used an example where the admission into prestigious schools is affected by both sporting ability and academic ability [5]. Based on a graphical illustration, they argued that a negative association between sporting ability and academic ability is induced by conditioning on admission, since both factors positively affect admission into prestigious schools. We present the first formal proof for the same assertion under a GWAS context. If Z and G both increase the probability of study participation, i.e. π c|1 > π c|0 and β g→c > 0, the theorem clearly shows that the direction of the bias will be negative as asserted by Griffith et al. This however, as pointed out by Shahar and Shahar, may not strictly hold because the particular direction of the collider bias can depend on multiple factors other than the sign of the effect of exposures [6].

The sex-gene interaction model
Although Theorem 1 shows a simple relationship between β g→c and β g→z , this simple model and relationship do not sufficiently explain the observed data. Note that Theorem 1 holds for all variants, and thus can be applied to heritability. That is, we can expect the following relationship to hold under this simple model: Z was approximately 2.3%. However, the estimate of between-sex study participation difference π 1|1 − π 1|0 in the UK Biobank was around 1.3% [7,8]. Therefore, even if h 2 C has its maximum value of 1.0 (because heritability is 1 at its maximum), the equation can not hold.
Pirastu et al. reported a similar conclusion that the simple model without interaction cannot explain the observed data, although their argument was based on simulations rather than analytical formulation. Based on extensive simulations, they showed that if the effect of a heritable trait on study participation depends on sex (sex-gene interaction), then the collider bias can explain the observed heritability of sex in the UK Biobank [2]. In other words, by simulations, they found that a sex-gene interaction is a core component that can explain the observed association between autosomal variants and sex.
To incorporate sex-gene interaction, we add an interaction term to our previous model.
The following theorem provides a theoretical and analytical foundation of what Piratsu et al. found by simulations.
Theorem 2. Under the sex-gene interaction model, the following holds: Proof of Theorem 2.
Proof. The proof is nearly identical to the proof of Theorem 1. The difference comes from the following modification to π 1|gz by adding a sex-gene interaction term.
Then by a similar argument we have the following: . From Lemma 1, if σ 2 is small, γ approximates to 1, which completes the proof.
The theorem has the following interpretation. As the current participation rate of genomic cohorts (such as the UKB) respect to the total population is fairly around 5%, assuming that β g→c and β zg→c has similar orders of magnitude, Theorem 2 shows that the latter term β zg→c (1 − π 1|1 ) is the dominant factor driving the association between autosomal variants and sex. Thus, it is not surprising that Piratsu et al. were able to explain the observed sex-gene association by the simulations incorporating the interaction.
This theorem not only explains the observation of Pirastu et al. but also allows several important predictions. As the sample size grows, the former term β g→c (π 1|1 − π 1|0 ) can become important in future studies. This is because of the following reason. The increased participation π can be interpreted as the increase in the intercept β 0 , given that all other β were unchanged. However, since this is a logistic model, the difference π 1|1 − π 1|0 will gradually increase as the participation rate increases to around 0.5 (due to the slope change in the logistic curve). Thus, the first term will become larger. Conversely, the second term that is expected to drive sexautosome association at the present state will become less important as 1 − π 1|1 becomes smaller. Consequently, the nature of collider bias will change gradually as the sample size grows in the future.
The formula can also be used to predict genetic correlation of sex and other traits. From Theorem 2, we have cov(β g→y , β g→z ) = −cov(β g→y , β g→c (π 1|1 − π 1|0 )) + cov(β g→y , β zg→c (1 − π 1|1 )) where the two correlation terms correspond to the genetic correlation of Y and C for the main term and the interaction term respectively. However, the two correlations cannot be recovered from the data because we cannot obtain the summary statistics of C as C = 0 is not observed.
For completeness, we state the following and end this section. Proof. Simplifying the logit model (without interaction) in Section 3 by setting the intercept β 0 to zero leads us to Model A and Model B. Given the logit model with interaction in Section 4, setting the intercept to zero and setting β zg→c to (K − 1)β g→c leads us to Model C. (Recall that, we defined K as a constant depicting modifying effect of Z on Y in Section 2.) Combining Theorem 1, 2 and 3 proves why Model C could explain the observed autosomal heritability of sex under realistic parameters but Model A and B could not. The following section discusses the effect of sex-independent and differential participation on the genotype-trait associations of other traits rather than sex.
5 The effect of sex-differential participation on the genotype-trait associations of other traits We first introduce notations. Let Y be the trait of interest, which has causal effect on participation. β Y =y,Z=z g→c is the effect of variant j on participation conditioned on phenotype Y = y and sex Z = z. π c|gyz = P (C = c|G j = g, Y = y, Z = z) is the participation probability conditioned on G j = g, Y = y, Z = z. Occasionally, we will omit the first subscript in π c|gyz and write π c|yz since the effect of a genetic variant is negligible according to the infinitesimal model. Then we can assume a logit model with sex-specific effects of G j and Y on C.
log π 1 − π = β 0 + β Y =y,Z=z g→c G j + β Z=z y→c Y + β z→c Z + Based on this notation, the following DAG depicts the sex-differential bias.
Under this model, the collider bias has an analytic expression: Theorem 4. Under sex-differential participation, the log odds ratio of G j on Y conditional on C = 1 is Then the log odds ratio under selection is β g→y = β g→y + log π 1|(g+1)1 π 1|g0 π 1|(g+1)0 π 1|g1 Lemma 3 (Multivariate mean-value theorem). Let A be open in R n . Suppose that partial derivatives ∂f ∂xi exist for f : A → R and are continuous on A. Then there exist q i ∈ (a i , h i ) for a, h ∈ A where the following holds Proof of Lemma 2.
Proof. We start from the last line of the proof of Lemma 1 (that is, the final form of equation (2)). Now, Y can be any phenotype and not necessarily biological sex. Thus, we can replace Z with Y , as follows.
While Lemma 1 showed the analytic form of the bias when the trait of interest was sex, for which we know there is no association to G, Lemma 2 shows the analytic form of the bias for any trait that may have association to G. As we can see, the bias form is similar to Lemma 1 but is presented as an additional term to the original association that would be observed under no selection. We note that the proofs of Lemma 1 and Lemma 2 were adopted from Nguyen et al. with some simplifications. We refer to the reference for a more detailed proof [3].

Proof of Lemma 3.
Proof. The proof of the theorem can be found in Theorem 6.2. of [9].
Let δ Y =y,Z=z (g) = β Y =y,Z=z g→c g, then π 1|gy can be treated as a two component function π 1|gy : R 2 → R respect to (δ Y =y,Z=0 , δ Y =y,Z=1 ). Applying two-variable mean-value theorem (Lemma 3) gives the following approximation: The partial derivative ∂ log π 1|gy ∂δ Y =y,Z=z can be calculated using the chain rule: Substituting this result to the second term of Lemma 2 finishes the proof.
Theorem 4 can be further refined to several simpler forms if we impose additional assumptions. We first present the results for the models suggested in Pirastu et al. [2]. In model A, the bias is the same to the sex case (Theorem 1) as Z has no role in the collider bias between G j and Y as shown in the following figure.
Model B and C share the same DAG but β Y =y,Z=z g→c is constant in respect of Y and Z in Model B. This leads to the following result.

Corollary 2 (Model B).
In Model C, β Y =y,Z=z g→c is constant only in respect of Y but not Z. The constant K, as in Pirastu et al. [2], is the ratio between β Z=1 g→c and β Z=0 g→c (β Z=1 g→c = Kβ Z=0 g→c ). Corollary 3 (Model C).
β Z=z g→c These formulas can be further simplified under the rare outcome assumption. When the participation rate is low, the logit function log x 1−x is approximately log x. The ratios can be expressed soley by β variables.
6 Sex-stratified analysis can substantially reduce the bias caused by sex-differential participation on the associations of other traits So far, we have derived the analytical formula for the bias caused by sex-differential participation on the associations of other traits. The formula under low participation rate is particularly of interest, since the current genetic cohorts only recruit a small portion of the target population. Importantly, our formulas can suggest a very simple solution to correct for the bias. Corollary 4 first shows that under low participation rate, the magnitude of bias can be large if model C is true. It is obvious that π 1|y1 −π 1|y0 is negligible so under Model B, as in the sex GWAS case, the bias is negligible. On the other hand, Model C can cause meaningful shift in the estimates with fairly realistic parameters. To demonstrate this point, we consider a situation in model C with K = −1. This means β Z=1 g→c = −1 · β Z=0 g→c and β Z=1 y→c = −1 · β Z=0 y→c . Then Corollary 4 is reduced to the following β g→y + e βy→c − e −βy→c+βz→c e βy→c + e −βy→c+βz→c − 1 − e βz→c 1 + e βz→c β g→c Here, we dropped the superscript for sex Z where β Z=0 We calculated the coefficient of the bias term for a range of values of β y→c and β z→c . This is presented in the following The values indicate that even with modest values of OR Y = exp β y→c and OR Z = exp β z→c , the magnitude of the bias becomes comparable to the true effect size. This effect is, however, negligible when the GWAS is performed in a sex-stratified manner. The bias term of the sex-stratified analysis can be easily obtained from Theorem 4. The rationale is that, for each single-sex analysis (for men and women each), the situation is identical to the case where the likelihood of participation is equal in men and women. Thus, when obtaining the bias for sex Z = z, replacing the superscripts of the coefficients of β Z=1−z • to β Z=z • will reduce the bias term to a very simple form: −β Z=z g→c (π 1|1z − π 1|0z ). Writing this down, β Z=z g→c (z = z since only one sex in the analysis) As this bias is negligible as in Corollary 1, single sex analysis is almost immune to the bias from sex-differential participation under the low participation rate situation. Thus, one can substantially reduce the bias by analyzing each sex separately and meta-analyzing the results, which is a surprisingly simple solution for this complicated problem. Note that Theorem 4 is general enough to include the scenario that the effect of variant G j on participation C is pleiotropic (the effect is mediated by more than one trait). The theorem predicts the bias when running a GWAS on particular trait Y . Then the effect of variant G j on C that is mediated by Y is depicted in G j → Y → C and the rest of the effect that is not mediated by Y (but possibly by traits other than Y ) is depicted in G j → C. If the effect of G j on C was totally by Y , only G j → Y → C would have been sufficient.
The earlier table demonstrates the degree of bias under the presence of sex imbalance. In our model, seximbalance is largely driven by OR Z = exp β z→c which is the odds ratio of a particular sex (e.g. woman) to participate in the study. An interesting observation is that the magnitude of the bias is not simply proportional to the degree of sex imbalance. For example, in the first row of the table where OR Y < 1, the more likely a woman to participate in a study (larger OR Z ), the smaller the bias becomes. On the other hand, in the last row of the table where OR Z > 1, the more likely a woman to participate in a study, larger the bias becomes.
In cases where participation rate is high, we can not use the approximation in Corollary 4 but it is possible to obtain a similar table of coefficients of the bias term using Theorem 4. Again, there is no simple trend in the magnitude of the bias with respect to the degree of sex imbalance. However, the overall magnitude of the bias is much smaller when the participation rate is high, as we can see on the table below where we assumed β 0 = 0 which corresponds to approximately 50% overall participation.  The ldscsim function can simulate trait with differing heritability. The generated trait were standardized to mean 0 and variance 1 by default. Binary traits were generated by treating the continuous trait simulated by ldscsim as the underlying liability. Sex was assigned randomly to individuals with probability 0.5. Specific simulation parameters are listed in the corresponding subsections. Study participation was simulated using the same model in Section 4. Total 50,000 individuals and 1,000 markers were simulated by the balding_nicholas_model function .

Robustness of sex-stratified GWAS
Study participation C was generated to have heritability h 2 = 0.2 in both sex. The odds ratio of trait y on study participation was set to |β Z=z y→c | = 0.5. The sign of the parameter was 1 in female and 0 in male. Trait y was generated with h 2 = 0. Participation rate was controlled by changing β 0 from -3 to 3. GWAS was performed using Hail with the combined simulated data and the sex-stratified simulated data. Fixedeffects meta-analysis estimates were produced by combining male and female estimates by inverse variance weighting. In the supplementary figures, the simulations were done with β 0 ∈ {−4, −3, −2, . . . , 4}, |β Z=z y→c | = 0.5, h 2 C ∈ {0, 0.1, 0.2, 0.3} and r g ∈ {−1, 0, 1}.

Bias in MR-IVW versus study participation rate experiment
Study participation C was generated to have heritability h 2 = 0.3 in both sex. Binary trait y 1 and y 2 were generated to have heritability h 2 = 0. Participation rate was controlled by changing β 0 from -3 to 3. GWAS was performed for y 1 and y 2 with the combined simulated data and the sex-stratified simulated data. MR-IVW estimates were obtained using the summary statistics generated from the previous step. The IVW method used in the study was adopted from Bowden et al. [10].

β difference between sex and sex GWAS
The following parameter configuration was used.